[WIP] [skip ci] Fuzz testing in Spark SQL #7625

JoshRosen · 2015-07-23T20:34:02Z

[skip ci]

This is a WIP pull request for some expression fuzz testing code that I'm working on as part of a
hackathon. I'm creating this pull request now in order to share the code and to have a pull request that I can reference from my other pull requests for fixing bugs that were found using this tester.

Features on my TODO list

Better logging to aid debuggability.
"Continuous" mode which dumps all results to files and keeps going when errors occur (designed to run overnight).
Validator which asserts that random queries return equivalent answers when run under different configuration modes (safe vs. unsafe vs safe w/o codegen, plus a few other permutations).
Plan transformer which takes valid logical query plans and transforms the into equivalent ones, then checks that both the original and transformed plans produce equivalent answers. This style of test is used in MySQL's testing tools.

List of potential bugs found during this testing

Note that most of these bugs are problems in analysis error reporting and not legitimate bugs in query execution. This tool isn't really capable of finding "wrong answer" bugs yet because it lacks an oracle for determining what the proper query answers are.

(:white_check_mark: indicates fixed, :construction: indicates a fix in progress)

Analysis issues:

The createDataFrame() methods should guard against null values being passed in (e.g. the user passes null instead of Row).
✅ The analyzer should check that join conditions have BooleanType: [SPARK-9292] Analysis should check that join conditions' data types are BooleanType #7630.
✅ The analyzer should ensure that set operations (union, intersect, and except) are only performed on tables that have the same number of columns: [SPARK-9293] [SPARK-9813] Analysis should check that set operations are only performed on tables with equal numbers of columns #7631
✅ Sorting based on array-typed columns should print an error at analysis time, not runtime. [SPARK-9295] Analysis should detect sorting on unsupported column types #7633
✅ - DataFrame.orderBy gives confusing analysis errors when ordering based on nested columns: https://issues.apache.org/jira/browse/SPARK-9323
The DATAFRAME_EAGER_ANALYSIS configuration flag does not work properly in all cases: there are still many corner-cases where invalid queries will eagerly throw analysis errors.

Type mismatches in joins are sometimes confusing. Let's say that we have two RDDs with columns that have the same name, but where one column is a struct and another is a boolean. If we try to join on a nested field then this can result in a confusing "Can't extract value" message instead of a more informative message that explains that the types are mismatched:

val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD("""{"a": {"b": 1}}""" :: Nil))
val df2 = sqlContext.read.json(sqlContext.sparkContext.makeRDD("""{"a": false}""" :: Nil))
df.join(df2, "a.b")

org.apache.spark.sql.AnalysisException: Can't extract value from a#26607;
at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(ExtractValue.scala:63)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$3.apply(LogicalPlan.scala:264)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$3.apply(LogicalPlan.scala:263)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:263)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:127)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:137)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
at org.apache.spark.sql.DataFrame.join(DataFrame.scala:404)

Execution issues:

Scala's BigDecimal can lead to OutOfMemoryErrors when computing rows' hashCodes (this is caused by SI-6173): [SPARK-9303] Decimal should use java.math.Decimal directly instead of using Scala wrapper #7635
✅ : When SortMergeJoin is enabled, we may get runtime crashes when attempting to join on struct field: [SPARK-9306][SQL] Don't use SortMergeJoin when joining on unsortable columns #7645.
✅ CatalystTypeConverters.toScala may throw UnsupportedOperationException when applied to an UnsafeRow: [SPARK-9368][SQL] Support get(ordinal, dataType) generic getter in UnsafeRow. #7682
✅ TungstenProject code generation fails for array<binary> columns.

Expression issues:

~~UTF8String.repeat can throw NegativeArraySizeException when applied to random bytes which have been casted to a string.~~ This is caused by extreme array sizes which overflow intmax.
UTF8String.reverse can throw ArrayIndexOutOfBoundsException when applied to random bytes which have been casted to a string.
✅ The methods in the Unevaluable trait should be final and the some of the new aggregate functions should inherit from this trait ([SPARK-9286] [SQL] Methods in Unevaluable should be final and AlgebraicAggregate should extend Unevaluable. #7627).
For extremely small inputs, the results of the Remainder expression can differ in the codegen and non-codegen paths:
```
(CAST(-2147483648, FloatType) % -1.8938038E-30) (types: List(FloatType, FloatType) [-4.0832423E-31] did not equal [-8.263847E-31]
```
This is most likely a numeric stability issue.
Code generation frequently crashes for expressions containing null literals, but this isn't a problem that will impact users due to our codegen fallback path.
✅ NaNvl should check that its two arguments are of the same floating point type: [SPARK-9549][SQL] fix bugs in expressions #7882
✅ Code-generated numeric comparison expressions may fail to compile for Boolean types: [SPARK-9549][SQL] fix bugs in expressions #7882

Minor UX issues:

The ORC writer could log a more informative error message when the user isn't using a HiveSQLContext:

java.lang.ClassNotFoundException: org.apache.spark.sql.hive.orc.DefaultSource
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ddl.scala:206)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ddl.scala:313)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)

Confusing unresolved alias errors are thrown somewhat later in analysis than I'd expect. Ideally we would never see UnresolvedException: Invalid call to dataType on unresolved object since we would have ideally checked for resolution before inspecting the data types.
- dropDuplicates seems especially prone to this problem.

SparkQA · 2015-07-23T20:40:29Z

Test build #38263 has finished for PR 7625 at commit 4c5dc9c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ExpressionFuzzingSuite extends SparkFunSuite with Logging

JoshRosen · 2015-07-24T00:09:58Z

I just pushed a commit which adds some randomized tests of the DataFrame API and it appears to have uncovered some runtime crashes for some simple queries. Going to investigate to try to find some deterministic minimal reproductions.

SparkQA · 2015-07-24T00:16:49Z

Test build #38283 has finished for PR 7625 at commit 133b27a.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class AlgebraicAggregate extends AggregateFunction2 with Serializable with Unevaluable
- abstract class AggregateFunction1 extends LeafExpression with Serializable

SparkQA · 2015-07-24T01:18:25Z

Test build #38289 has finished for PR 7625 at commit dd16f4d.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class AlgebraicAggregate extends AggregateFunction2 with Serializable with Unevaluable
- abstract class AggregateFunction1 extends LeafExpression with Serializable

SparkQA · 2015-07-24T03:36:32Z

Test build #38302 has finished for PR 7625 at commit 37e4ce8.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ChangePrecision(child: Expression) extends UnaryExpression
- abstract class AlgebraicAggregate extends AggregateFunction2 with Serializable with Unevaluable
- abstract class AggregateFunction1 extends LeafExpression with Serializable
- abstract class SetOperation(left: LogicalPlan, right: LogicalPlan) extends BinaryNode
- case class Union(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
- case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
- case class Except(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
- case class DecimalType(precision: Int, scale: Int) extends FractionalType
- case class DecimalConversion(precision: Int, scale: Int) extends JDBCConversion

SparkQA · 2015-08-16T22:06:16Z

Test build #41000 has finished for PR 7625 at commit 0c7e9d0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class SetOperation(left: LogicalPlan, right: LogicalPlan) extends BinaryNode
- case class Union(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
- case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
- case class Except(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)

SparkQA · 2016-08-16T00:49:20Z

Test build #63813 has finished for PR 7625 at commit 7664e37.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-16T00:54:14Z

Test build #63815 has finished for PR 7625 at commit d1d3d53.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen changed the title ~~[WIP][CI-SKIP] Expression fuzz testing in Spark SQL~~ [WIP] [skip ci] Expression fuzz testing in Spark SQL Jul 23, 2015

JoshRosen changed the title ~~[WIP] [skip ci] Expression fuzz testing in Spark SQL~~ [WIP] [skip ci] Fuzz testing in Spark SQL Jul 23, 2015

JoshRosen force-pushed the fuzz-test branch from 4c5dc9c to cb71db0 Compare July 24, 2015 00:09

JoshRosen added 16 commits July 23, 2015 18:13

Fix SPARK-9292.

6f2b909

Check condition type in resolved()

03120d5

Initial commit for SQL expression fuzzing harness

e1f462e

Apply implicit casts (in a hacky way for now)

f8daec7

More messy WIP prototyping on expression fuzzing

df00e7a

Add some comments; speed up classpath search

2dcbc10

Move dummy type coercion to a helper method

c20a679

More code cleanup and comments

95860de

Use non-mutable interpreted projection.

abaed51

Log expression after coercion

129ad6c

Run tests in deterministic order

e1f91df

Test with random inputs of all types

adc3c7f

Ignore BinaryType for now, since it led to some spurious failures.

ae5e151

Begin to add a DataFrame API fuzzer.

a354208

Don't puts nulls into the DataFrame

13f8c56

Print logical plans.

dd16f4d

JoshRosen force-pushed the fuzz-test branch from 133b27a to dd16f4d Compare July 24, 2015 01:14

JoshRosen added 4 commits July 23, 2015 19:33

Fuzzer improvements.

7f2b771

Fix SPARK-9293

326d759

Merge branch 'SPARK-9293' into fuzz-test

4a2c684

Support methods that take varargs Column parameters.

37e4ce8

Merge remote-tracking branch 'origin/master' into fuzz-test

558f04a

JoshRosen added 4 commits July 28, 2015 16:15

Fix long line.

63492c4

Merge branch 'unsafe-row-null-fixes' into unsafe-by-default

6dc34f4

Merge branch 'unsafe-by-default' into fuzz-test

6ac2d82

Merge remote-tracking branch 'origin/master' into fuzz-test

18615a6

JoshRosen mentioned this pull request Aug 1, 2015

[SPARK-9526][SQL]Utilize randomized tests to reveal potential bugs in sql expressions #7855

Closed

JoshRosen added 5 commits August 14, 2015 13:42

Merge remote-tracking branch 'origin/master' into fuzz-test

704abc1

Merge remote-tracking branch 'origin/master' into fuzz-test

b549b3e

Fix compilation with latest master.

ca8168a

Update to ignore some new analysis exceptions.

0c7e9d0

Move RandomDataFrameGenerator to own file.

fb0671f

JoshRosen added 4 commits August 16, 2015 17:23

WIP

78a71af

Filter failing BinaryType array test.

3b06849

Merge remote-tracking branch 'origin/master' into fuzz-test

574130b

WIP

a4c9b33

JoshRosen closed this Sep 2, 2015

JoshRosen added 7 commits May 27, 2016 12:16

Merge remote-tracking branch 'origin/master' into fuzz-test

d36f8f5

Also ignore BRound expression.

ae5055a

Fix serializability.

dfdab5e

Merge remote-tracking branch 'origin/master' into fuzz-test

bb4cc2a

More input type validation.

e60f231

Updates for DataSet API.

94087cb

Merge remote-tracking branch 'origin/master' into fuzz-test

d1d3d53

JoshRosen reopened this Aug 16, 2016

JoshRosen force-pushed the fuzz-test branch from 7664e37 to d1d3d53 Compare August 16, 2016 00:48

JoshRosen closed this Aug 18, 2016

JoshRosen deleted the fuzz-test branch August 18, 2016 22:27

joshrosen-stripe restored the fuzz-test branch August 30, 2019 00:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [skip ci] Fuzz testing in Spark SQL #7625

[WIP] [skip ci] Fuzz testing in Spark SQL #7625

JoshRosen commented Jul 23, 2015 •

edited

Loading

SparkQA commented Jul 23, 2015

JoshRosen commented Jul 24, 2015

SparkQA commented Jul 24, 2015

SparkQA commented Jul 24, 2015

SparkQA commented Jul 24, 2015

SparkQA commented Aug 16, 2015

SparkQA commented Aug 16, 2016

SparkQA commented Aug 16, 2016

[WIP] [skip ci] Fuzz testing in Spark SQL #7625

[WIP] [skip ci] Fuzz testing in Spark SQL #7625

Conversation

JoshRosen commented Jul 23, 2015 • edited Loading

Features on my TODO list

List of potential bugs found during this testing

Analysis issues:

Execution issues:

Expression issues:

Minor UX issues:

SparkQA commented Jul 23, 2015

JoshRosen commented Jul 24, 2015

SparkQA commented Jul 24, 2015

SparkQA commented Jul 24, 2015

SparkQA commented Jul 24, 2015

SparkQA commented Aug 16, 2015

SparkQA commented Aug 16, 2016

SparkQA commented Aug 16, 2016

JoshRosen commented Jul 23, 2015 •

edited

Loading